--- /dev/null
+=== How the Blkif Drivers Work ===
+Andrew Warfield
+andrew.warfield@cl.cam.ac.uk
+
+The intent of this is to explain at a fairly detailed level how the
+split device drivers work in Xen 1.3 (aka 2.0beta). The intended
+audience for this, I suppose, is anyone who intends to work with the
+existing blkif interfaces and wants something to help them get up to
+speed with the code in a hurry. Secondly though, I hope to break out
+the general mechanisms that are used in the drivers that are likely to
+be necessary to implement other drivers interfaces.
+
+As a point of warning before starting, it is worth mentioning that I
+anticipate much of the specifics described here changing in the near
+future. There has been talk about making the blkif protocol
+a bit more efficient than it currently is. Keir's addition of grant
+tables will change the current remapping code that is used when shared
+pages are initially set up.
+
+Also, writing other control interface types will likely need support
+from Xend, which at the moment has a steep learning curve... this
+should be addressed in the future.
+
+For more information on the driver model as a whole, read the
+"Reconstructing I/O" technical report
+(http://www.cl.cam.ac.uk/Research/SRG/netos/papers/2004-xenngio.pdf).
+
+==== High-level structure of a split-driver interface ====
+
+Why would you want to write a split driver in the first place? As Xen
+is a virtual machine manager and focuses on isolation as an initial
+design principle, it is generally considered unwise to share physical
+access to devices across domains. The reasons for this are obvious:
+when device resources are shared, misbehaving code or hardware can
+result in the failure of all of the client applications. Moreover, as
+virtual machines in Xen are entire OSs, standard device drives that
+they might use cannot have multiple instantiations for a single piece
+of hardware. In light of all this, the general approach in Xen is to
+give a single virtual machine hardware access to a device, and where
+other VMs want to share the device, export a higher-level interface to
+facilitate that sharing. If you don't want to share, that's fine.
+There are currently Xen users actively exploring running two
+completely isolated X-Servers on a Xen host, each with it's own video
+card, keyboard, and mouse. In these situations, the guests need only
+be given physical access to the necessary devices and left to go on
+their own. However, for devices such as disks and network interfaces,
+where sharing is required, the split driver approach is a good
+solution.
+
+The structure is like this:
+
+ +--------------------------+ +--------------------------+
+ | Domain 0 (privileged) | | Domain 1 (unprivileged) |
+ | | | |
+ | Xend ( Application ) | | |
+ | Blkif Backend Driver | | Blkif Frontend Driver |
+ | Physical Device Driver | | |
+ +--------------------------+ +--------------------------+
+ +--------------------------------------------------------+
+ | X E N |
+ +--------------------------------------------------------+
+
+
+The Blkif driver is in two parts, which we refer to as frontend (FE)
+and a backend (BE). Together, they serve to proxy device requests
+between the guest operating system in an unprivileged domain, and the
+physical device driver in the physical domain. An additional benefit
+to this approach is that the FE driver can provide a single interface
+for a whole class of physical devices. The blkif interface mounts
+IDE, SCSI, and our own VBD-structured disks, independent of the
+physical driver underneath. Moreover, supporting additional OSs only
+requires that a new FE driver be written to connect to the existing
+backend.
+
+==== Inter-Domain Communication Mechanisms ====
+
+===== Event Channels =====
+
+Before getting into the specifics of the block interface driver, it is
+worth discussing the mechanisms that are used to communicate between
+domains. Two mechanisms are used to allow the construction of
+high-performance drivers: event channels and shared-memory rings.
+
+Event channels are an asynchronous interdomain notification
+mechanism. Xen allows channels to be instantiated between two
+domains, and domains can request that a virtual irq be attached to
+notifications on a given channel. The result of this is that the
+frontend domain can send a notification on an event channel, resulting
+in an interrupt entry into the backend at a later time.
+
+The event channel between two domains is instantiated in the Xend code
+during driver startup (described later). Xend's channel.py
+(tools/python/xen/xend/server/channel.py) defines the function
+
+
+def eventChannel(dom1, dom2):
+ return xc.evtchn_bind_interdomain(dom1=dom1, dom2=dom2)
+
+
+which maps to xc_evtchn_bind_interdomain() in tools/libxc/xc_evtchn.c,
+which in turn generates a hypercall to Xen to patch the event channel
+between the domains. Only a privileged domain can request the
+creation of an event channel.
+
+Once the event channel is created in Xend, its ends are passed to both the
+front and backend domains over the control channel. The end that is
+passed to a domain is just an integer "port" uniquely identifying the
+event channel's local connection to that domain. An example of this
+setup code is in linux-2.6.x/drivers/xen/blkfront/blkfront.c in
+blkif_status_change, which receives several status change events as
+the driver starts up. It is passed an event channel end in a
+BLKIF_INTERFACE_STATUS_CONNECTED message, and patches it in like this:
+
+
+ blkif_evtchn = status->evtchn;
+ blkif_irq = bind_evtchn_to_irq(blkif_evtchn);
+ if ( (rc = request_irq(blkif_irq, blkif_int,
+ SA_SAMPLE_RANDOM, "blkif", NULL)) )
+ printk(KERN_ALERT"blkfront request_irq failed (%ld)\n",rc);
+
+
+This code associates a virtual irq with the event channel, and
+attaches the function blkif_int() as an interrupt handler for that
+irq. blkif_int() simply handles the notification and returns, it does
+not need to interact with the channel at all.
+
+An example of generating a notification can also be seen in blkfront.c:
+
+
+static inline void flush_requests(void)
+{
+ DISABLE_SCATTERGATHER();
+ wmb(); /* Ensure that the frontend can see the requests. */
+ blk_ring->req_prod = req_prod;
+ notify_via_evtchn(blkif_evtchn);
+}
+}}}
+
+notify_via_evtchn issues a hypercall to set the event waiting flag on
+the other domain's end of the channel.
+
+===== Communication Rings =====
+
+Event channels are strictly a notification mechanism between domains.
+To move large chunks of data back and forth, Xen allows domains to
+share pages of memory. We use communication rings as a means of
+managing access to a shared memory page for message passing between
+domains. These rings are not explicitly a mechanism of Xen, which is
+only concerned with the actual sharing of the page and not how it is
+used, they are however worth discussing as they are used in many
+places in the current code and are a useful model for communicating
+across a shared page.
+
+A shared page is set up by a guest first allocating and passing the
+address of a page in its own address space to the backend driver.
+
+
+ blk_ring = (blkif_ring_t *)__get_free_page(GFP_KERNEL);
+ blk_ring->req_prod = blk_ring->resp_prod = resp_cons = req_prod = 0;
+ ...
+ /* Construct an interface-CONNECT message for the domain controller. */
+ cmsg.type = CMSG_BLKIF_FE;
+ cmsg.subtype = CMSG_BLKIF_FE_INTERFACE_CONNECT;
+ cmsg.length = sizeof(blkif_fe_interface_connect_t);
+ up.handle = 0;
+ up.shmem_frame = virt_to_machine(blk_ring) >> PAGE_SHIFT;
+ memcpy(cmsg.msg, &up, sizeof(up));
+
+
+blk_ring will be the shared page. The producer and consumer pointers
+are then initialised (these will be discussed soon), and then the
+machine address of the page is send to the backend via a control
+channel to Xend. This control channel itself uses the notification
+and shared memory mechanisms described here, but is set up for each
+domain automatically at startup.
+
+The backend, which is a privileged domain then takes the page address
+and maps it into its own address space (in
+linux26/drivers/xen/blkback/interface.c:blkif_connect()):
+
+
+void blkif_connect(blkif_be_connect_t *connect)
+
+ ...
+ unsigned long shmem_frame = connect->shmem_frame;
+ ...
+
+ if ( (vma = get_vm_area(PAGE_SIZE, VM_IOREMAP)) == NULL )
+ {
+ connect->status = BLKIF_BE_STATUS_OUT_OF_MEMORY;
+ return;
+ }
+
+ prot = __pgprot(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED);
+ error = direct_remap_area_pages(&init_mm, VMALLOC_VMADDR(vma->addr),
+ shmem_frame<<PAGE_SHIFT, PAGE_SIZE,
+ prot, domid);
+
+ ...
+
+ blkif->blk_ring_base = (blkif_ring_t *)vma->addr
+}}}
+
+The machine address of the page is passed in the shmem_frame field of
+the connect message. This is then mapped into the virtual address
+space of the backend domain, and saved in the blkif structure
+representing this particular backend connection.
+
+NOTE: New mechanisms will be added very shortly to allow domains to
+explicitly grant access to their pages to other domains. This "grant
+table" support is in the process of being added to the tree, and will
+change the way a shared page is set up. In particular, it will remove
+the need of the remapping domain to be privileged.
+
+Sending data across shared rings:
+
+Shared rings avoid the potential for write interference between
+domains in a very cunning way. A ring is partitioned into a request
+and a response region, and domains only work within their own space.
+This can be thought of as a double producer-consumer ring -- the ring
+is described by four pointers into a circular buffer of fixed-size
+records. Pointers may only advance, and may not pass one another.
+
+
+ rsp_cons----+
+ V
+ +----+----+----+----+----+----+----+
+ | | | free |RSP1|RSP2|
+ +----+----+----+----+----+----+----+
+ req_prod->| | --------> |RSP3|
+ +----+ +----+
+ |REQ8| | |<-rsp_prod
+ +----+ +----+
+ |REQ7| | |
+ +----+ +----+
+ |REQ6| <-------- | |
+ +----+----+----+----+----+----+----+
+ |REQ5|REQ4| free | | |
+ +----+----+----+----+----+----+----+
+ req_cons---------^
+
+
+
+By adopting the convention that every request will receive a response,
+not all four pointers need be shared and flow control on the ring
+becomes very easy to manage. Each domain manages its own
+consumer pointer, and the two producer pointers are visible to both (Xen/include/hypervisor-ifs/io/blkif.h):
+
+
+
+/* NB. Ring size must be small enough for sizeof(blkif_ring_t) <=PAGE_SIZE.*/
+ #define BLKIF_RING_SIZE 64
+
+ ...
+
+/*
+ * We use a special capitalised type name because it is _essential_ that all
+ * arithmetic on indexes is done on an integer type of the correct size.
+ */
+typedef u32 BLKIF_RING_IDX;
+
+/*
+ * Ring indexes are 'free running'. That is, they are not stored modulo the
+ * size of the ring buffer. The following macro converts a free-running counter
+ * into a value that can directly index a ring-buffer array.
+ */
+#define MASK_BLKIF_IDX(_i) ((_i)&(BLKIF_RING_SIZE-1))
+
+typedef struct {
+ BLKIF_RING_IDX req_prod; /* 0: Request producer. Updated by front-end. */
+ BLKIF_RING_IDX resp_prod; /* 4: Response producer. Updated by back-end. */
+ union { /* 8 */
+ blkif_request_t req;
+ blkif_response_t resp;
+ } PACKED ring[BLKIF_RING_SIZE];
+} PACKED blkif_ring_t;
+
+
+
+As shown in the diagram above, the rules for using a shared memory
+ring are simple.
+
+ 1. A ring is full when a domain's producer and consumer pointers are
+ equal (e.g. req_prod == resp_cons). In this situation, the
+ consumer pointer must be advanced. Furthermore, if the consumer
+ pointer is equal to the other domain's producer pointer,
+ (e.g. resp_cons = resp_prod), then the other domain has all the
+ buffers.
+
+2. Producer pointers point to the next buffer that will be written to.
+ (So blk_ring[MASK_BLKIF_IDX(req_prod)] should not be consumed.)
+
+3. Consumer pointers point to a valid message, so long as they are not
+ equal to the associated producer pointer.
+
+4. A domain should only ever write to the message pointed
+ to by its producer index, and read from the message at it's
+ consumer. More generally, the domain may be thought of to have
+ exclusive access to the messages between its consumer and producer,
+ and should absolutely not read or write outside this region.
+
+In general, drivers keep a private copy of their producer pointer and
+then set the shared version when they are ready for the other end to
+process a set of messages. Additionally, it is worth paying attention
+to the use of memory barriers (rmb/wmb) in the code, to ensure that
+rings that are shared across processors behave as expected.
+
+==== Structure of the Blkif Drivers ====
+
+Now that the communications primitives have been discussed, I'll
+quickly cover the general structure of the blkif driver. This is
+intended to give a high-level idea of what is going on, in an effort
+to make reading the code a more approachable task.
+
+There are three key software components that are involved in the blkif
+drivers (not counting Xen itself). The frontend and backend driver,
+and Xend, which coordinates their initial connection. Xend may also
+be involved in control-channel signalling in some cases after startup,
+for instance to manage reconnection if the backend is restarted.
+
+===== Frontend Driver Structure =====
+
+The frontend domain uses a single event channel and a shared memory
+ring to trade control messages with the backend. These are both setup
+during domain startup, which will be discussed shortly. The shared
+memory ring is called blkif_ring, and the private ring indexes are
+resp_cons, and req_prod. The ring is protected by blkif_io_lock.
+Additionally, the frontend keeps a list of outstanding requests in
+rec_ring[]. These are uniquely identified by a guest-local id number,
+which is associated with each request sent to the backend, and
+returned with the matching responses. Information about the actual
+disks are stored in major_info[], of which only the first nr_vbds
+entries are valid. Finally, the global 'recovery' indicates that the
+connection between the backend and frontend drivers has been broken
+(possibly due to a backend driver crash) and that the frontend is in
+recovery mode, in which case it will attempt to reconnect and reissue
+outstanding requests.
+
+The frontend driver is single-threaded and after setup is entered only
+through three points: (1) read/write requests from the XenLinux guest
+that it is a part of, (2) interrupts from the backend driver on its
+event channel (blkif_int()), and (3) control messages from Xend
+(blkif_ctrlif_rx).
+
+===== Backend Driver Structure =====
+
+The backend driver is slightly more complex as it must manage any
+number of concurrent frontend connections. For each domain it
+manages, the backend driver maintains a blkif structure, which
+describes all the connection and disk information associated with that
+particular domain. This structure is associated with the interrupt
+registration, and allows the backend driver to have immediate context
+when it takes a notification from some domain.
+
+All of the blkif structures are stored in a hash table (blkif_hash),
+which is indexed by a hash of the domain id, and a "handle", really a
+per-domain blkif identifier, in case it wants to have multiple connections.
+
+The per-connection blkif structure is of type blkif_t. It contains
+all of the communication details (event channel, irq, shared memory
+ring and indexes), and blk_ring_lock, which is the backend mutex on
+the shared ring. The structure also contains vbd_rb, which is a
+red-black tree, containing an entry for each device/partition that is
+assigned to that domain. This structure is filled by xend passing
+disk information to the backend at startup, and is protected by
+vbd_lock. Finally, the blkif struct contains a status field, which
+describes the state of the connection.
+
+The backend driver spawns a kernel thread at startup
+(blkio_schedule()), which handles requests to and from the actual disk
+device drivers. This scheduler thread maintains a list of blkif
+structures that have pending requests, and services them round-robin
+with a maximum per-round request limit. blkifs are added to the list
+in the interrupt handler (blkif_be_int()) using
+add_to_blkdev_list_tail(), and removed in the scheduler loop after
+calling do_block_io_op(), which processes a batch of requests. The
+scheduler thread is explicitly activated at several points in the code
+using maybe_trigger_blkio_schedule().
+
+Pending requests between the backend driver and the physical device
+drivers use another ring, pending_ring. Requests are placed in this
+ring in the scheduler thread and issued to the device. A completion
+callback, end_block_io_op, indicates that requests have been serviced
+and generates a response on the appropriate blkif ring. pending
+reqs[] stores a list of outstanding requests with the physical drivers.
+
+So, control entries to the backend are (1) the blkio scheduler thread,
+which sends requests to the real device drivers, (2) end_block_io_op,
+which is called as serviced requests complete, (3) blkif_be_int()
+handles notifications from the frontend drivers in other domains, and
+(4) blkif_ctrlif_rx() handles control messages from xend.
+
+==== Driver Startup ====
+
+Prior to starting a new guest using the frontend driver, the backend
+will have been started in a privileged domain. The backend
+initialisation code initialises all of its data structures, such as
+the blkif hash table, and starts the scheduler thread as a kernel
+thread. It then sends a driver status up message to let xend know it
+is ready to take frontend connections.
+
+When a new domain that uses the blkif frontend driver is started,
+there are a series of interactions between it, xend, and the specified
+backend driver. These interactions are as follows:
+
+The domain configuration given to xend will specify the backend domain
+and disks that the new guest is to use. Prior to actually running the
+domain, xend and the backend driver interact to setup the initial
+blkif record in the backend.
+
+(1) Xend sends a BLKIF_BE_CREATE message to backend.
+
+ Backend does blkif_create(), having been passed FE domid and handle.
+ It creates and initialises a new blkif struct, and puts it in the
+ hash table.
+ It then returns a STATUS_OK response to xend.
+
+(2) Xend sends a BLKIF_BE_VBD_CREATE message to the backend.
+
+ Backend adds a vbd entry in the red-black tree for the
+ specified (dom, handle) blkif entry.
+ Sends a STATUS_OK response.
+
+(3) Xend sends a BLKIF_BE_VBD_GROW message to the backend.
+
+ Backend takes the physical device information passed in the
+ message and assigns them to the newly created vbd struct.
+
+(2) and (3) repeat as any additional devices are added to the domain.
+
+At this point, the backend has enough state to allow the frontend
+domain to start. The domain is run, and eventually gets to the
+frontend driver initialisation code. After setting up the frontend
+data structures, this code continues the communications with xend and
+the backend to negotiate a connection:
+
+(4) Frontend sends Xend a BLKIF_FE_DRIVER_STATUS_CHANGED message.
+
+ This message tells xend that the driver is up. The init function
+ now spin-waits until driver setup is complete in order to prevent
+ Linux from attempting to boot before the disks are connected.
+
+(5) Xend sends the frontend an INTERFACE_STATUS_CHANGED message
+
+ This message specifies that the interface is now disconnected
+ (instead of closed).
+ The domain updates it's state, and allocates the shared blk_ring
+ page. Next,
+
+(6) Frontend sends Xend a BLKIF_INTERFACE_CONNECT message
+
+ This message specifies the domain and handle, and includes the
+ address of the newly created page.
+
+(7) Xend sends the backend a BLKIF_BE_CONNECT message
+
+ The backend fills in the blkif connection information, maps the
+ shared page, and binds an irq to the event channel.
+
+(8) Xend sends the frontend an INTERFACE_STATUS_CHANGED message
+
+ This message takes the frontend driver to a CONNECTED state, at
+ which point it binds an irq to the event channel and calls
+ xlvbd_init to initialise the individual block devices.
+
+The frontend Linux is stall spin waiting at this point, until all of
+the disks have been probed. Messaging now is directly between the
+front and backend domain using the new shared ring and event channel.
+
+(9) The frontend sends a BLKIF_OP_PROBE directly to the backend.
+
+ This message includes a reference to an additional page, that the
+ backend can use for it's reply. The backend responds with an array
+ of the domains disks (as vdisk_t structs) on the provided page.
+
+The frontend now initialises each disk, calling xlvbd_init_device()
+for each one.